Method and system for an aggregate web site search database

ABSTRACT

Signature schema documents may be pre-defined using a query language to provide instructions for application by an engine to extract data from web pages of respective web sites. For a particular web page, signature schema instructions identify a web page family for the web page and extract desired data from the web page in accordance with its web page family. The instructions use signatures previously identified within web pages of the same family to distinguish the web page family from others of the web site and to distinguish the desired data from other data for the web page family. A server may make one or more requests to obtain web pages from various web sites and apply respective signature schemas maintained in a repository coupled to the engine. Extracted data can be stored to an aggregate database.

CROSS REFERENCE

This application claims the benefit of the prior filing of U.S.Provisional Patent Application Ser. No. 60/924503 filed May 17, 2007,the disclosure of which is incorporated herein by reference.

COPYRIGHT

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor patent disclosure, as it appears in the Patent and Trademark Officepatent file or records, but otherwise reserves all copyright rights.

FIELD

The present application relates generally to telecommunications and moreparticularly to a system and method for an aggregate web site searchdatabase.

BACKGROUND

Web sites host and provide information using web pages that arecommunicated electronically via a telecommunications network. Accessingthis information by some client computing devices can be challenging.Computing devices are becoming smaller and increasingly utilize wirelessconnectivity. Examples of such computing devices include portablecomputing devices that include wireless network browsing capability aswell as telephony and personal information management capabilities.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is schematic representation of a system for content navigation.

FIG. 2 is a schematic representation of a wireless communication devicefrom FIG. 1.

FIG. 3 illustrates a flow of interactions among components of the systemof FIG. 1.

FIG. 4 is a schematic representation of an aggregate web site searchdatabase for e-commerce.

FIG. 5 is a schematic representation of tables in an aggregate web sitesearch database for e-commerce.

FIG. 6 illustrates a method of creating the aggregate web site searchdatabase.

FIG. 7 illustrates a method of querying the aggregate web site searchdatabase.

FIGS. 8A-8D and 9A-9D respectively illustrate representative web pagesrendered on a first browser window and portions of said representativeweb pages transcoded and rendered on a second browser window inaccordance with an embodiment.

DETAILED DESCRIPTION OF THE EMBODIMENTS

The smaller size of such client devices necessarily limits their displaycapabilities. Furthermore the wireless connections to such devicestypically have less or more expensive bandwidth than corresponding wiredconnections. The Wireless Application Protocol (“WAP”) was designed toaddress such issues, but WAP can still provide a very unsatisfactoryexperience or even completely ineffective experience, particularly wherethe small client device needs to effect a connection with web sites thathost web pages that are directed to traditional full desktop browsers.In addition, the ability to access data from multiple web sitesconcurrently and extract relevant data can be difficult and timeconsuming.

Signature schema documents may be pre-defined using a query language, toprovide instructions for application by an engine to extract data fromweb pages of respective web sites for storage to an aggregate database.For a particular web page, signature schema instructions identify a webpage family for the web page and extract desired data from the web pagein accordance with its web page family. The instructions use signaturespreviously identified within web pages of the same family to distinguishthe web page family (e.g. in accordance with a shared template for eachfamily) from others of the web site and to distinguish the desired datafrom other data for the web page family. A server may make one or morerequests to obtain web pages from various web sites and apply respectivesignature schemas maintained in a repository coupled to the engine.Extracted desired data may be stored to a database coupled to the engineto facilitate querying of the data and enable aggregate results to bepresented to a client machine (e.g. a wireless communication device) orenable regeneration of original web pages based upon the signatureschema.

In the present disclosure there is provided a method of aggregating website data from one or more web sites, the method comprising: sending apage request to a web site selected from the one or more web sites;receiving the requested web page from the selected web site; retrievingsignature schema associated with the requested web page wherein thesignature scheme identifies data fields within the requested web page;applying signature schema to the requested web page to extract data fromthe requested web page; and storing extracted data to an aggregatedatabase, wherein the aggregate database comprises data extracted fromthe one or more web sites.

In the present disclosure there is provided a system for aggregating website data from one or more web sites, the system comprising: at leastone computing device comprising a processor and a memory coupledthereto, said memory storing instructions and data for configuring theprocessor to: send a page request to a web site selected from the one ormore web sites; receive a web page from the selected web site based uponthe sent page request; retrieve signature schema associated with therequested web page; apply signature schema to the requested web pagedata to extract data identified by the signature schema; and storeextracted data to an aggregate database comprising data extracted fromthe one ore more web sites.

In the present disclosure there is provided a computer program productstoring computer readable instructions which when executed by a computerprocessor configure the processor for: sending a page request to a website selected from the one or more web sites; receiving the requestedweb page from the selected web site; retrieving signature schemaassociated with the requested web page wherein the signature schemeidentifies data fields within the requested web page; applying signatureschema to the requested web page to extract data from the requested webpage; and storing extracted data to an aggregate database, wherein theaggregate database comprises data extracted from the one or more websites.

In the present disclosure there is provided a method of aggregating website data from one or more web sites, the method comprising: sending apage request to a web site selected from the one or more web sites;receiving the requested web page from the selected web site; retrievingsignature schema associated with the requested web page wherein thesignature scheme identifies data fields within the web and wherein thesignature schema are extensible Markup Language (XML) documentscomprising query language for extracting data from the requested webpage; applying signature schema to the received web page to extract datafrom the requested web page; storing extracted data to an aggregatedatabase, wherein the aggregate database comprises data extracted fromthe one or more web sites; receiving a search query from a clientmachine for data stored in the aggregate database; generating a databasequery based upon the received search query; and retrieving data from theaggregate database defined by the query. The client machine may be awireless device.

Referring now to FIG. 1, there is illustrated a system 100 for contentnavigation via a telecommunications network. In a present embodimentsystem 100 comprises one or more client computing devices in the form ofclient machines 102A and 102B (collectively 102), web site servers 106,107 and 109 respectively host web sites 104, 103 and 105 and a gatewayand schema server 120. Machines 102 are respectively coupled tocommunicate with gateway and schema server 120 to obtain web pages (e.g.110) transcoded from web sites 103, 104 and 105 and to access aggregatedata from the web sites through web server 125 hosting web site 150.

In the present embodiment, web sites 103, 104 and 105 host web siteswhich contain data that is to be aggregated into database 126. Forexample, web site 104 comprises a web server 106 serving web pages (e.g.110) defined from one or more web page family templates 108A-108D(collectively 108) and web page content (described further herein below)from data store 112. In the present embodiment of system 100, gatewayand schema server 120 is coupled to a schema repository 124 from whichto obtain a signature schema 122 for a particular web site. Signatureschema documents (e.g.122) provide instructions and data with which anengine 140 of server 120 can extract data from web pages (e.g. 110) andtranscode same to a target format to provide transcoded web page data(e.g. 130 and 132) to the respective requesting client machines 102A and102B as described more fully below. Gateway and schema server 120 mayalso be coupled to a database 126 for retrieving/storing data extractedfrom web sites in accordance with its operations. The database 126 maybe a relational database for storing extracted data objects and elementsand their relationships from web sites in relation to the definedsignature schema. The stored data can be accessed by a Structured QueryLanguage (SQL) to retrieve desired data from database 126. Signatureschemas for respective web sites may be defined (e.g. coded) using acomputing device 128 as described herein below. A web server 125 iscoupled to the aggregate web site database 126 enables access to theaggregated web site database 126 data by a web site 150. The web server125 can also provide a data collection engine 152, or web crawler, forsending requests to web sites 103, 104 and 105 for desired page andprovide content to schema engine 140 for processing.

Representative client machines 102 include any type of computing orelectronic device that can be used to communicate and interact withcontent available via web sites. Each of the client machines 102 may beoperated by a respective user U (not shown). Interaction with aparticular user includes presenting information on a client machine(e.g. by rendering on a display screen) as well as receiving input at aclient machine (e.g. such as via a keyboard for transmitting to a website). In the present embodiment, client machine 102A comprises a mobileor wireless electronic device with the combined functionality of apersonal digital assistant, cell phone, email paging device, and aweb-browser. Such a mobile electronic device may comprise a keyboard (orother input device(s)), a display screen, a speaker, (and other outputdevice(s) (e.g. LEDs)) and a chassis for housing such components. Thechassis may further house one or more central processing units, volatilememory (e.g. random access memory), persistent memory (e.g. Flash readonly memory) and network interfaces to allow client machine 102A tocommunicate over the telecommunication network.

Referring now to FIG. 2, a schematic block diagram shows an exemplaryclient machine 102A in greater detail. It should be emphasized that thestructure in FIG. 2 is purely exemplary, and contemplates a device thatmay be used for both wireless voice (e.g. telephony) and wireless data(e.g. email, web browsing, text) communications. Client machine 102Aincludes one or more input devices which in a present embodimentincludes a keyboard and, typically, additional input buttons,collectively 200, an optional pointing device 202 (e.g. a trackball ortrackwheel) and a microphone 204. Other input devices, such as a touchscreen, and camera lens are also contemplated. Input fromkeyboard/buttons 200, pointing device 202 and microphone 204 may bereceived at a processor 208. Processor 208 may be further operativelycoupled with a volatile storage unit 212 (e.g. read only memory (“ROM”),Erasable Electronic Programmable Read Only Memory (“EEPROM”), or FlashMemory) and a volatile storage unit 216 (e.g. random access memory(“RAM”) speaker 220, display screen 224 and one or more lights (LEDs222). Processor 208 may be operatively coupled for networkcommunications via a subsystem 226. Wireless communications areeffective via at least one radio (e.g. 228) such as for Wi-Fi orcellular wireless communications. Client machine 102A also may beconfigured for wired communications such as via a USB or other port andfor short range wireless communications such as via a Bluetooth® radio(all not shown).

Programming instructions that implement the functional teachings ofclient machine 102A as described herein are typically maintained,persistently, in non-volatile storage unit 212 and used by processor 208which makes appropriate utilization of volatile storage 216 during theexecution of such programming instructions. Of particular note is thatnon-volatile storage unit 212 persistently maintains a web browserapplication 86 and, in the present embodiment, a native menu application82, each of which can be executed on processor 208 making use ofnon-volatile storage 216 as appropriate. An operating system and variousother applications (not shown) are maintained in non-volatile storageunit 212 according to the desired configuration and functioning ofclient machine 102A, one specific non-limiting example of which is acontact manager application (also known as an address book, not shown)which stores a list of contacts, addresses and phone numbers of interestto user U and allows user U to view, update, and delete those contacts,as well as providing user U an option to initiate telecommunications(e.g. telephone, email, instant message (IM), short message service(SMS)) directly from that contact manager application.

Native menu application 82 may be configured to provide menu choices touser U according to the particular application (or other context) thatis being accessed. By way of example, while user U is activating thecontact manager application, user U can activate menu application 82 toaccess one or more menu choices available that are respective to contactmanger application 90. For example, menu choices may include options toinvoke other applications (e.g. a mapping application to map a contact'saddress) or communication functions (e.g. call, SMS, IM, email, etc.) onthe client machine 102A for a particular contact. Menu application 82may be associated to a particular input button (e.g. one of buttons 200)and invoked to provide a contextual menu comprised of one or more menuchoices that are reflective of the context in which the button 200 wasselected. Note that the options in a contextual menu are stored withinnon-volatile storage 212 as being specifically associated with arespective application. Menu application 82 may be therefore configuredto generate one or more different contextual menus that are reflectiveof the particular context in which the menu application 82 is invoked.For example, in an email application where an email is being composed,invoking menu application 82 would generate a contextual menu thatincluded the options of sending the email, cancelling the email, addingaddresses to the email, adding attachments, and the like. The contentsfor such a contextual menu would also be maintained in non-volatilestorage 212. Other examples of contextual menus will occur to those ofordinary skill in the art.

FIGS. 8A-8D and 9A-9D respectively illustrate representative web pagesrendered on a first browser window and portions of a subset of data fromsaid representative web pages transcoded and rendered on a secondbrowser window in accordance with an embodiment. FIG. 8A illustrates arepresentative home web page 860A of an e-commerce web site (e.g. 104)in a browser window 850. Window 850 is illustrative of a rendering to alarge size display device (e.g. desktop monitor). Web page 860Acomprises, among other things, a menu portion 852 and a primary contentdisplay portion 854, in the example, showing various advertisements 855for products. FIG. 9A illustrates the menu portion 852 extracted andtranscoded and rendered as a web page on a second browser window 950.Window 950 is illustrative of a rendering to a small size display device(e.g. of a wireless mobile device). In addition to transcoding as a webpage, menu portion 852 may be transcoded for menu application 82 e.g.for invocation when browsing the site 104 as referenced further herein.

FIG. 8B illustrates an exemplary product web page 860B in window 850showing various product data (collectively 866) including image 866A,price 866, title 866C and description 866D data that is transcoded andshown in window 950 of FIG. 9B. Also transcoded is the web pagehierarchy list 868 showing where the page is on the web site.

FIG. 8C illustrates an exemplary product list web page 860C in window850 showing a list of products (collectively 870). A subset of theproduct data such as image 870A, price 870B, and title 870C istranscoded and shown in window 950 of FIG. 9C. Note that multiple pages872 may be provided for the list 870.

FIG. 8D illustrates an exemplary account checkout web page 860D inwindow 850 showing a login form 880 for receiving account login andpassword, which form is transcoded and shown in window 950 of FIG. 9D.Though not shown, other checkout pages (e.g. for payment or orderconfirmation, etc.), search pages, product and information pages may besimilarly transcoded.

Returning now to FIG. 1, web servers 106, 107 and 109 and gateway andschema server 120 (which can, if desired, be implemented on a singleserver) can be based on any commonly available server environments orplatforms including a module that houses one or more central processingunits, volatile memory (e.g. random access memory), persistent memory(e.g. hard disk devices) and network interfaces to allow servers 106,107, 109 and 120 to communicate over the telecommunications network. Webservers hosts software applications comprising instructions and data forgenerating and serving web pages dynamically from the template families108 and current informational content therefore from data store 112.Load balancing, security/firewall, billing, account and otherapplications may also be present as is well-known in the art.

Gateway and schema server 120 hosts software applications comprisinginstructions and data for proxying requests and responses between theclient machines 102 and web sites 103, 104 and 105. In addition tosoftware for maintaining HTTP communications, performing requests,maintaining sessions, handling cookies, etc., engine 140 may beimplemented in software to apply the signature schemas to web pages fromweb sites. There may be provided an interpreter that interprets thesignature schema document and applies the actions against the web pagecode (as an ASCII (plain text) to extract desired data to produce aresult set. A renderer may be provided to express the desired dataresult set (i.e. transcode to a target format such as cHTML (CompactHTML) for a mobile device browser) for transmitting to the clientmachines also in accordance with the signature schema.

The web server 125 provides web pages to requesting client machinethrough a browser or application on the client for rendering. The webdata may be directly pushed to client machines 102A by e-mail or byother push based applications, or the data may be accessed by queries toweb site 150 directly. The web site 150 may also extract content fromthe aggregate database 126 and apply signature schema 122 to theextracted database data, which schema may be configured to transcode thedata in accordance with the target client machine 102A to tailor theoutput result.

Machines 102 schema server 120, and web sites 103, 104, 105 and 125 arecoupled via a telecommunication network (not shown) typically comprisingone or more interconnected networks that may include wired and (at leastfor machine 102A) wireless networks. It should now be understood thatthe nature of the network is not particularly limited and is, ingeneral, based on any combination of architectures that will supportinteractions between client machines 102 and servers 106, 107, 109, 125and 120. In a present embodiment the network includes the Internet aswell as appropriate gateways and backhauls.

More specifically, in the present embodiment, a wireless network forclient machine 102A may be based on core mobile network infrastructure(e.g. Global System for Mobile communications (“GSM”); Code DivisionMultiple Access (“CDMA”), Enhanced Data rates for GSM Evolution(“EDGE”), Evolution Data-Optimized (“EV-DO”), High Speed Downlink PacketAccess (“HSPDA”), Universal Mobile Telecommunications System (“UMTS”),etc.) or on wireless local area network (“WLAN”) infrastructures such asthe Institute for Electrical and Electronic Engineers (“IEEE”) 802.11Standard (and its variants) or Bluetooth or the like or hybrids thereof.In the present embodiment of system 100 it is contemplated that clientmachine 102B may be another type of client machine such as a PC (desktopor laptop) configured to include a full desktop computer or as a“thin-client”. Typically such have larger display monitors/screens thanportable machines like 102A. A wired network for system 100 and machine102B can be based on a T1, T3 or any other suitable wired connection.

As previously stated in relation to FIGS. 1 and 2, each of the clientmachines 102 is configured to interact with content available over thenetwork, including web pages on web site 104. In a present embodiment,client machines 102A and 102B may navigate for content using a browserapplication (e.g. 86). As will be explained further below, on clientmachine 102A, browser application 86 may be a mini-browser in the sensethat it may be configured to render web pages on the relatively smalldisplay 224 of client machine 102A. Often, during such rendering, thosepages are presented in a format that may be different from how thosepages are rendered on a traditional desktop browser application (e.g.browser 86 of client machine 102B). Mini-browsers typically attempt toconvey substantially the same information as if the web pages had beenrendered on a full browser such as Internet Explorer®, Safari® orFirefox® on a traditional desktop or laptop computer like client machine102B.

FIG. 3 is a flowchart illustrating operations/interactions forgenerating and maintaining an aggregate database from web sites 103-105for populating and updating database 126. The flowchart provides anexample of the interaction among the gateway and schema server 120/140and data collection engine 152, with web servers 106, 107 and 109hosting web sites 103, 104 and 105 to generate and maintain theaggregate database 126. The data collection engine 152 (DCE), makes arequest 302 to the web site's web server (for example web server 106)for the specified pages based upon the type of data to be aggregated.The web page code (e.g. 110) is generated by server 106 and sent 306 toDCE 152. The web page code received is a text file. It typically doesnot include objects referenced by the code such as images, video, audio,further web pages, etc. that are typically subsequently retrieved andinserted at the time of rendering a web page by a browser. The schemaserver engine 140 (SSE) (for example, in parallel or without waiting fora response from server 106) makes a request 304 to the signaturerepository 124 for the signature schema document 122 for the web site,which request may use the domain in the URL as an identifier forobtaining the document 122. The schema server engine 140 receives 308the schema and does not render the web page 110 per se but instead usesthe instructions in the signature schema document 122 to extract thedesired data from the web page 110. The signature schema 122 isconfigured to extract data from web page 110 in accordance with thespecific desired content characteristics for the database 126 which isbased upon the target client machines 102A, having knowledge of display224 capabilities—such as screen size, resolution, and otherparameters—useful in determining the way in which the transcoded data isto be displayed on the machine 102A. The web page 110 or extracted datais stored 310 in database 126 in a relational database structure, whereonly data related to the defined signature schema from the web site isstored.

Client machine 102A can then make a request 312 to web site 150 onserver 125 for a query to the database 126 regarding desired web siteshaving a specific domain (URL). Web site 150 requests 314 relevant datafrom the database 126. The results are extracted 316 and then sent 318to the client machine as aggregate results, or as a proxy as if thequery was made directly to the source web site, and transcoded inaccordance with the schema 122, to the requesting client machine 102Aprocessed by the signature schema engine 140 before presentation by webserver 125. Alternatively, the data may be pushed to a client machine toa push based application. As noted above, transcoded data 130 maycomprise transcoded navigational data for menu application 82 andinformational content data (e.g. a list of products and relatedinformation from a web page) for displaying by browser application 86.The process can then be repeated for each identified web site such asweb sites 103 and 105.

Signature schemas are pre-defined documents, and may be eXtensibleMarkup Language (XML) documents utilizing an SQL-like query language, toincorporate instructions and data with which to intelligently extractthe data from web pages (which web pages are typically coded in HTML,DHTML, XHTML, XML, RSS, Javascript, etc). This extracted data may betranscoded and provided to client machines 102, used to dynamicallygenerate a relational database (e.g. 126) or both. Each signature schemaincorporates an understanding of a particular web site's data includingrelationships among the various data (e.g. among its primaryinformational content found in the body of its web pages as well asamong such content and associated navigational data (e.g. web pagelinks) that govern the data in the page. As described further hereinbelow, prior knowledge of the web page code including specificidentifiers, tags and text (i.e. strings) used within the code(sometimes referred to as “signatures” herein), may be used to defineinstructions to identify portions of the code of interest and to extractspecific desired data.

In accordance with the present embodiment, a signature schema documentmay be defined for all the pages of a particular web site. Largedata-driven web sites (e.g. 104) don't maintain thousands of individualweb pages per se. The sites typically adopt a few page family templates108 and dynamically populate these with pertinent content from database112 comprising information (e.g. weather, stock data, news,shopping/product data, patent data, trade-mark data etc.) as applicablewhen a client requests a particular page. Each template represents afamily of pages having objects and attributes. Below are representativeexample page family templates and their objects and attributes for a website offering news and an e-commerce web site offering products for saleelectronically:

EXAMPLE 1 News Site

-   Family: List Page-   Objects: lists a selection of news stories-   Attributes: Title, abstract and date-   Family: Detail page-   Objects: lists a single news story (and optionally other related    stories)-   Attributes: Journalist, City, Date, Title, Full Story, Image

Example 2 E-Commerce Site

-   Family: List Page-   Objects: lists a selection of products-   Attributes: Image, Item Name, Price, Sale Price-   Family: Search Page (a specific kind of list page)-   Objects: Similar to a list page-   Attributes: Similar to a list page

Each family of pages (the family template) can be identified by a“signature” or unique set of one or more features that automaticallyidentifies a given page on a web site as part of the family anddifferentiates that family from another family of pages. Similarly eachobject and attribute field of interest can be identified with itsrespective unique signature within a family of pages. A signature schemadocument typically comprise numerous pieces of information (commands),for example, information that instructs the engine 140 for:

-   -   identifying all page families;    -   identifying and extracting data (i.e. desired objects and        attributes) for each page family;    -   capturing the (implicit or explicit) relationships between the        objects and attributes; and    -   transcoding the data.

A signature schema document may also be configured to enable specialfunctionality for the target web site including searching, logging in auser, purchasing items, etc.

In accordance with a present embodiment, the structure and syntax of arepresentative signature schema document for a representative e-commercesite eshop.ca is shown and described. Engine 140 may be configured toreceive web page code comprising text data and search through the textin accordance with the schema document instructions that provideSQL-query like language instructions. Engine 140 maintains a pointerwithin the text as it moves through the web page code performing variousactions, as described below, in accordance with the schema instructions.Table 1 illustrates a snippet of a representative signature schema:

TABLE 1 XML Signature Schema Snippet for E-Shop.ca 1 <?xml version=“1.0”encoding=“ISO-8859-1” ?> 2 <site> 3   <version major=“1” minor=“2”/> 4  <url location=“http://www.eshop.ca” key=“eshop.ca” name=“E-Shop” /> 5  <advanced> 6 7     <index_link value=“http://www.eshop.ca/home.asp” />8   </advanced> 9   <page_type> 10     <lookup type=“pex”action=“locate_string” name=“list_elements” id=“mylist_1”        ref=“Compare products” alt1=“Sort products” /> 11     <lookup type=“pex”action=“locate_string” name=“item_elements” id=“myitem_1”ref=“&quot;product-details&quot;” /> 12     <lookup type=“pex”action=“locate_string” name=“menu_elements” id=“mymenu_2”ref=“anc-lhsnav-subItem” /> 13     <lookup type=“pex”action=“locate_string” name=“menu_elements” id=“mymenu_1”ref=“product-table” /> 14     <lookup type=“pex” action=“locate_string”name=“item_elements” id=“myitem_1” ref=“*” /> 15   </page_type> 16  <list_elements id=“mylist_1”> ... 17   </list_elements> ... 18  <item_elements id=“myitem_1”> 19     <actions> 20       <lookuptype=“pex” action=“move_ptr” ref=“&lt;/head&gt;” /> 21     </actions> 22    <element> 23       <lookup type=“pex” action=“get_string”name=“image” ref=“largeimageref” location=“after” start=“&lt;imgsrc=&quot;” end=“&quot;” /> 24       <lookup type=“pex”action=“get_string” name=“title” ref=“product- details-prd-title”location=“after” start=“&lt;span” end=“&lt;/span&gt;” include_sz=“1”strip_tags=“1” /> 25       <lookup type=“pex” action=“get_string”name=“price” ref=“our price:” location=“after” start=“&lt;td”end=“&lt;/td&gt;” include_sz=“1” strip_tags=“1” /> 26       <lookuptype=“pex” action=“get_string” name=“sale_price” ref=“sale price:”location=“after” start=“&lt;td” end=“&lt;/td&gt;” include_sz=“1”strip_tags=“1” tolerance=“1” /> 27       <lookup type=“pex”action=“get_string” name=“description” ref=“detailbox-text” location=“middle” start=“&lt;p” end=“&lt;/p&gt;” include_sz=“1” strip_tags=“1”/> 28     </element> 29   </item_elements> ...

In the XML code snippet of Table 1, instructions at line 4 are forverifying that the web page under consideration and the signature schemarelate to the same web site/domain—eshop.ca. Instructions at lines 9-15are for determining the particular page family to which the web pageunder consideration belongs. A respective signature that defines theparticular page family has been previously identified for use todistinguish the page. The engine 140 processes the <page type> tag byregistering the identification strings for each page family. When a webpage is obtained by the engine as input, the engine may be able toidentify the page family by its unique string ref=″ and the commandprovides the related tag within the signature schema document wherefurther instructions for the particular web pages are found:

-   action=″locate_string″: command to check for the existence of a    string.-   name=″: identifies the type of page family for each identified    family.-   id=″: assigns an id to the page family that is used across the    signature schema document.

For example, at line 10, the instructions identify a web page using thealternative signatures “Compare products” or “Sort Products”. Web pageswith these strings are of the same family type. The instructions at line10 provide a reference tag to further instructions for this family,providing a link to instructions for the list_elements page family withand ID of mylist_1 (see lines 16-17). Similarly the other lookupinstructions provide references to the specific instructions within thesignature schema document for handling a web page of each web pagefamily. Representative instructions for some of the web page familiesare provided in Table 1, for example, at lines 16-17 and 18-29 withothers omitted for brevity.

With reference to the extraction instructions for one of the web pagefamilies (e.g. item_elements id=“myitem_(—)1”) at lines 18-29, theinstruction at line 20 advances the scan pointer within the text file ofthe web page code to a beginning limit of a region of interest indicatedby a signature reference. This establishes an upper limit for reviewwithin the text file. Though not shown in this table, an end limit maybe defined as well (See Table 4). Further such instructions at lines22-28 may comprise commands to locate desired data using “signatures”such as string identifiers that uniquely identify the data within theregion of interest. In the present example the instructions locate andextract one or more elements, namely, product image, title, price, saleprice and description for a product of the item web page family. Forexample, instructions at line 23 extract a string in between the first“&It;img src=&quot;” and “&quot;” that appears after next appearance of“largeimageref”. The string returned is the path (relative URL at website eshop.ca) to the product image. By advancing a search scan pointerwithin the web code to a desired location, references before thatlocation can be skipped when searching. Any prior instances of asignature string such as “largeimageref” may be ignored. In this way,otherwise ambiguous signature references can be avoided.

The example in Table 1 shows at least some of the instructions (e.g.lines 23-27) including one or more directional references relative tothe signatures to locate and extract the desired data. For example,directional references such as “before” or “after” command the engine toextract desired data that is in a relative position in the web pagebefore or after the signature string (i.e. ref=). Moreover, suchinstructions may further include at least one of a start reference or anend reference further pinpointing the location of the desired data inaccordance with that direction. Additional directional referenceinformation is discussed herein with reference to code snippets in otherTables and the discussion of an embodiment of signature transcodingengine syntax presented below.

The example within Table 1 demonstrates the extraction of data and theestablishment of relationships between objects and elements within asame page of a web site. However, signature schema documents may furthercapture relevant attributes of an object across pages. For example, auser of client machine 102A may click through a number of web pages ineshop.ca to get to a specific product page (e.g. Department→ProductCategory→Product Sub-Category→Specific Product, such as TV &Video>19″-21″ TVs>LCD TVs>BrandX Product. The navigational hierarchyrepresenting a categorization may be captured and associated to theextracted objects and there elements.

For brevity, certain instructions were omitted from Table 1. Tables 2-4provide representative instructions for further web page families fore-shop.ca that may be read with Table 1. Table 2 below providesrepresentative instructions, e.g. for lines 16 and 17 of Table 1,including instructions for a web page family related to a list ofitems/products for sale. Whereas instructions at lines 22-28 providedproduct data extraction instructions for a web page family showing asingle item (i.e. product), the instructions of Table 2 provideadditional instructions that repeat product data extractions for eachproduct in the list.

TABLE 2 XML Signature Schema Snippet for Product List Web Page Family ofE-Shop.ca 1  <list_elements id=“mylist_1”> 2   <paging> 3   <page_variable value=“page” /> 4    <page_start value=“0” /> 5   <lookup type=“pex” action=“get_string” name=“link”ref=“Next&amp;nbsp” location=“before” start=“&lt;a class=”end=“&lt;/a&gt;” include_sz=“1” strip_tags=“1” /> 6   </paging> 7  <actions> 8    <lookup type=“pex” action=“move_ptr” ref=“Sort orcompare products” ref_alt_1=“Sort products” /> 9   </actions> 10  <element> 11    <lookup type=“pex” action=“get_string” name=“link”ref=“thumbnail” location=“before” start=“&lt;ahref=&quot;”end=“&quot;&gt;” /> 12    <lookup type=“pex” action=“get_string”name=“image” ref=“thumbnail” location=“middle” start=“&quot;”end=“&quot;” /> 13    <lookup type=“pex” action=“get_string”name=“title” ref=“class=&quot;tx-strong-dgrey&amp;quot;”location=“after” start=“&lt;a href=” end=“&lt;/a&gt;” include_sz=“1”strip_tags=“1” /> 14    <lookup type=“pex” action=“get_string”name=“price” ref=“pricepill/” location=“after” start=“/” repeat_start=“1” end=“.gif” tolerance=“1” /> 15    <lookup type=“pex”action=“move_ptr” ref=“pricepill/” /> 16   </element> 17 </list_elements>

If the engine 140 identifies that the page is of the “mylist_(—)1”family, the engine determines the location in the signature schemadocument that contains the signature for the objects and elements ofthat family and applies the instructions therefor. A product list ate-shop.ca may span multiple web pages. Instructions at lines 2-6 ofTable 2 find the number of pages and generate the links for each of thepages. Instructions at lines 7-9 (action tag) advance the search scanpointer to the region of web page code that may be of interest (i.e. inthis case, the start of the list). In this way, a local signaturereference can be used and any earlier ambiguous references skipped.Skipping to the local region of interest may also make the specificationof the signature reference less complicated.

Taking advantage of inherent repeated patterns in the web page code,instructions at lines 10-16 (elements tag) of Table 2 provide productdata extraction instructions that may be repeated for each product inthe list. The engine 140 may be provided with commands to scan for eachdata element of interest using a signature reference e.g. ref=″, anaction, one or more positional instruction(s) to further identify thedata within the text of the web page code, and any additional text datamanipulation instructions to extract the desired data (e.g. to removeHTML formatting characters or add characters). The instruction at line15 moves the scan pointer to the end of the object (in this example aproduct in a list of products) to ready the instructions for applicationagainst the next object (product) in the list.

More particularly:

-   -   lookup type=“pex”: string lookup    -   action=“get_string”: returns a value back that is the desired        element of the object.    -   name=“link”: the object element, in this case the link to the        product page    -   ref=“thumbnail”: the reference string that identifies where to        find the value of the link    -   location=“before”: the value of the link is before the ref        string    -   start=“&lt;a href=&quot;”: look for the ref string after this        value    -   end=“&quot;&gt;”: look for the ref string before this value.

TABLE 3 E-Shop Search Family Signature Schema Snippet 1 <search_elementsid=“mysearch_1”> 2  <settings> 3   <search_pathvalue=“http://www.eshop.ca/search/search.asp” /> 4   <search_variablevalue=“keyword” /> 5  </settings> 6  <paging> 7   <page_variablevalue=“page” /> 8   <page_start value=“0” /> 9   <lookup type=“pex”action=“get_string” name=“link”      ref=“Next&amp;nbsp”location=“before” start=“&lt;a href=”      repeat_start=“1”end=“&lt;/a&gt;”      include_sz=“1” strip_tags=“1” /> 10  </paging> 11 <actions> 12   <lookup type=“pex” action=“move_ptr”ref=“bg-compare-hero” /> 13  </actions> 14  <element> 15   <lookuptype=“pex” action=“get_string” name=“link” ref=“&gt;” location=“after”     start=“&lt;a href=&quot;” end=“&quot;&gt;” /> 16   <lookuptype=“pex” action=“get_string” name=“image” ref=“&lt;a href”location=“after”      start=“&lt;img src=&quot;” end=“&quot;” /> 17  <lookup type=“pex” action=“get_string” name=“title”     ref=“class=&quot;tx-strong-dgrey&amp;quot;” location=“after”start=“&lt;a      href=” end=“&lt;/a&gt;” include_sz=“1” strip_tags=“1”/> 18   <lookup type=“pex” action=“move_ptr” ref=“bg-compare-hero” /> 19 </element> 20 </search_elements>

If the engine 140 has identified that the page is of the “mysearch_(—)1”family the engine applies the portion of the signature schema documentthat contains the signature for the objects and elements of that family,shown above in Table 3.

-   <settings>...</settings>: Contains any web page specific manual    overrides such as excluding certain menu items, customization,    modification of a menu that may be desired. In this example, as per    line 3 a value of form variable “keyword” will be posted to    “http://www.eshop.ca/search/search.asp”.-   <paging>...</paging>: Manages paging for the search pages.-   <actions>...</actions>: Instruct the engine to move the scan pointer    to the string “bg-compare-hero” (line 12 of Table 3) and start    looking for elements from there.-   <element>...</element>: Contains lookup instructions for each object    element as previously described.

TABLE 4 E-shop Menu Family Signature Schema Snippet 1 <menu_elementsid=“mymenu_1”> 2  <settings> 3   <black_list value=“Site Index##ExternalLink” /> 4  </settings> 5  <actions> 6   <lookup type=“pex”action=“move_ptr” ref=“bg-lhsnav-title” /> 7   <lookup type=“pex”action=“end_ptr” ref=“&lt;/table&gt;” /> 8  </actions> 9  <element> 10  <lookup type=“pex” action=“get_string” name=“link” ref=“&lt;li&gt;”location=“after”      start=“&lt;a href=&quot;” end=“&quot;” /> 11  <lookup type=“pex” action=“get_string” name=“title” ref=“&lt;li&gt;”location=“after”      start=“&lt;a href=&quot;” end=“&lt;/a&gt;”include_sz=“1” strip_tags=“1” /> 12   <lookup type=“pex”action=“move_ptr” ref=“&lt;/li&gt;”/> 13  </element> 14 </menu_elements>

If the engine 140 has identified that it is looking for a menu on a pagethat contains the menu style of the “mymenu_(—)1” family, the engineapplies the portion of the signature schema document that contains thesignature for the objects and elements of that family, shown above inTable 4.

-   <settings>...</settings>: Contains any page specific manual    overrides such as exclude list, customization, modification,    personalization, etc. In this example, as per line 3, any result    that matches “Site Index”, “External Link” are excluded but partial    matches are also possible by using wild card strings.-   <action>...</action>: Lines 6-7 of Table 4 sets the start and end    limits to instruct the engine 140 where to look for menu items.-   <element>...</element>: Contains lookup instructions for each object    element as previously described. In this example, lines 10 and 11 of    Table 4, an element in ‘mymenu_1’ (each individual menu entry of web    page) contains link and title as its properties. Line 12 instructs    the engine to move the pointer to “&It;/li&gt;” to get ready to loop    through and extract the next menu item with the same elements,    taking advantage of the repeated patterns within the text of the web    page code.

Though the example described relates to extracting informational contentfor an e-commerce oriented site, no limitation should be applied.Similar instructions may be defined for other types of sites, for pageswhich permit a user to input information and for navigational dataextraction.

Signature schema document 122 may further comprise transcodinginstructions (not shown) for use by engine 140 to express the extracteddesired data (which may be retrieved from database 126) in a targetformat (e.g. a format of HTML, XML, script etc.) for use by therequesting client machine 102. For example, the transcoding instructionsmay define a web page for displaying the extracted data in browserapplication 86 that is suitable for display on the client machine 102.The formatting rules can be system and/or user defined and can includeparameters such as but not limited to: object positioning, objectcolour, object size, object shape, object font/image characteristics,background style, and navigational item display (e.g. in a menu asdescribed above) or for display with the content in the generated pageon the client screen. Browser application 86 (e.g. of machine 102A) maybe configured for using a markup language (e.g. cHTML) or other codeformat that is not identical to the code provided by web page 110.Alternatively, transcoding instructions may be defined to express theextracted desired data in XML or another code format such as for use bya different client application or plug-in to a client application suchas menu application 82 or another application (not shown) on clientmachine 102.

Signature schema documents may be prepared (i.e. coded) using acomputing device such as computing device 128. Computing device 128 maybe any suitable desktop or laptop device capable of coding documents(which may be but need not be XML-type documents) and may be configuredto automate or semi-automate coding of such documents.

Computing device 128 may be coupled to web site 104 to retrieve webpages from the site for reviewing to prepare the custom signature schemadocument for the site. Computing device 128 may be configured toautomatically review the web page code and apply heuristics or othertechniques (e.g. spatial analysis) to determine probable content ofinterest (i.e. desired data) and generate code to extract the desireddata. For example, primary content of interest tends to be locatedtoward the centre of the web page. In another embodiment, the computingdevice may facilitate a user coding signature schema to manually assistwith the analysis of the web page and identification of desired data andthe generation of the instructions. Computing device 128 may be furthercoupled to repository 124 to provide (e.g. up-load or publish) codedsignature schema documents for use by server 120.

It will be apparent to a person of ordinary skill in the art that as aweb site may be re-designed or otherwise changed such that the code ofone or more web page families may be changed or a family added, anexisting signature schema may require re-coding to account for thechange/addition, as applicable.

Signature (Transcoding) Engine Syntax

In accordance with a present embodiment, further details concerning thesyntax of schema instructions are described.

Lookup Syntax

-   The lookup tag instructs the engine 140 to perform an insert, delete    or query the document contents.-   Type: Defines the data type of the lookup. Type may be “pex” for a    string expression. Type may also support more advanced options such    as regular expressions, API calls, and SQL queries.

Action:

-   Action=“locate_string”: Look for a string (“ref″ identifier”) value    within the data. Return true iff the string exists in the data (i.e.    the “ref” identifier index>=0).-   Action=“replace_string”: Replace a string within the data with the    “ref” identifier.-   Action=“move_ptr”: Remove all characters in the data that exist    before the location of the “ref” identifier.-   Action=“end_ptr”: Remove all characters in the data that exist after    the location of the “ref” identifier.-   Action=“get_string” Extract a string based on the location of the    “ref”, “start”, and “end” identifiers.-   ID: ID is an identifier of another section within the signature. It    allows the result of a query to trigger another set of actions    within the signature. This is primarily used when identifying page    types. Once a match has been made, specific instructions are    executed that are marked with this ID. Recursive data structures    (e.g. lists within lists) may also be supported.-   Ref: Ref defines the initial identifier that the lookup searches    for. If an AND case is required multiple ref identifiers can be used    (i.e. ref=“string1” ref1=“string2”). If an OR case is required    ref_[ref identifier]_alt_1 can be used (i.e. ref=“string1”    ref_alt_(—)1=“string2”). To demonstrate (X=“1”∥Y=“2”) &&    (A=“8”|B=“9”) would translate to ref=“1” ref_alt_1=“2” ref1=“8”    ref1_alt_1=“9”.-   Repeat_[identifier]: Repeat executes the identifier query additional    times. For example, if ref=“hello” to set the identifier index at    the second occurrence of hello the following tag would be added:    repeat_ref=“1”.

Location:

-   Location=“before”: Search the data in a reverse direction, starting    from the “ref” identifier. This implies that both the “start” and    “end” identifier indexes must be less than the “ref” index.-   Location=“middle”: Search the data in two directions, starting from    the “ref” identifier. This implies that the “ref” identifier index    is greater than the “start” identifier index and less than the “end”    identifier index.-   Location=“after”: Search the data in a forward direction, starting    from the “ref” identifier. This implies that both the “start” and    “end” identifier indexes must be greater than the “ref” index.-   Start: Start is primarily used when action=“get_string” and may also    be used for replace/remove instructions. The start identifier index    will be the start index of the string to extract. If an AND case is    required multiple “start” identifiers can be used (i.e.    start=“string1” start1=“string2”). If an OR case is required    start_[start identifier]_(—alt)_1 can be used (i.e. start=“string1”    start_alt_1=“string2”). To demonstrate (X=“1”∥Y=“2”) &&    (A=“8”∥B=“9”) would translate to start=“1” start_alt_1=“2”    start1=“8” start1_alt_1=“9”. To find the n^(th) match see the repeat    syntax.-   End: End is primarily used when action=“get_string” and may also be    used for replace/remove instructions. If an AND case is required    multiple “end” identifiers can be used (i.e. end=“string1”    end1=“string2”). If an OR case is required end_[end    identifier]_alt_1 can be used (i.e. end=“string1”    end_alt_1=“string2”). To demonstrate (X=“1”|Y=“2”) && (A=“8”|B=“9”)    would translate to end=“1” end_alt_(—)1=“2” end1=“8” end1_alt_1=“9”.    To find the n^(th) match see the repeat syntax-   Max_index: Max_index is used to limit the scope of a query by    ensuring that no other identifier index is greater than the    “max_index”. If an AND case is required multiple “max index”    identifiers can be used (i.e. max_index=“string1”    max_index1=“string2”). If an OR case is required    max_index_[max_index identifier]_alt_1 can be used (i.e.    max_index=“string1” max_index_alt_1=“string2”). To demonstrate    (X=“1”∥Y=“2”) && (A=“8”∥B=“9”) would translate to max_index=“1”    max_index alt_1=“2” max_index=“8” max_index_alt 1=“9”. To find the    nth match see the repeat syntax.-   Max_Index_Use_Ref: Max_Index_Use_Ref is a Boolean value set to 0    or 1. It is used with Max_index. When set to 0, the “max_index” will    begin querying at the beginning of the data. When set to 1, the    “max_index” will begin querying from the “ref” identifier index.-   Gbl_append_[identifier]: Gbl_append appends a string passed via the    url to the identifiers query value-   Gbl_Repeat_[identifier]: Gbl_Repeat executes the identifier query    additional times. For example, if ref=“hello” to set the identifier    index at the second occurrence of hello the following tag would be    added: gbl_repeat_ref=“var” where var would be passed in the URL    i.e. http://www.eshop.ca/mobile/fatfree.asp?site=...&url=...&var=1.-   Tolerance: Tolerance is a Boolean value set to 0 or 1. It is used to    return an empty string. By default tolerance is set to 0 which    enforces that a property be found on a page, otherwise the page will    be marked as “invalid” and an appropriate error message returned.    When set to one, an empty value is returned for properties that can    not be located.-   Include_sz: Include_sz is a Boolean value set to 0 or 1 and used    with get string. It is by default set to 0. When set to 1 it    includes the “start” value and the “end” value as part of the    result.-   Include_start: Include_start is a Boolean value set to 0 or 1 and    used with get_string. It is by default set to 0. When set to 1 it    includes the “start” value as part of the result.-   Include_end: Include_end is a Boolean value set to 0 or 1 and used    with get string. It is by default set to 0. When set to 1 it    includes the “end” value as part of the result.-   Closetag: Closetag is a Boolean value set to 0 or 1 and used when    action=“get_string”. It appends /> to the extracted value.-   Strip_Tags: Strip_Tags removes HTML tags from the value and used    when action=“get_string”.-   Strip_tags=“1”: remove all tags.-   Strip_tags=“2”: remove all br and script tags.-   Strip_tags=“3”: remove all tags except replace </p> </li> with <br>.-   Strip_tags=“4”: remove all tags except replace </div> <br> with    <br>.-   Strip_tags=“tag1,tag2...tagN”:remove all tag1, tag2,...tagN leaving    any tag not listed.-   Notrim: Notrim is a Boolean value set to 0 or 1 and used when    action=“get_string”. By default all value have white spaced trimmed.    When this property is set to 1, white space is not trimmed.-   Append: Append is a string value and used when action=“get_string”.    It appends a string to the extracted value.-   Prepend: Prepend is a string value and used when    action=“get_string”. It prepends a string to the extracted value.-   Upper: Upper is a Boolean value set to 0 or 1 and used when    action=“get_string”. It converts all characters to upper case.-   Lower: Lower is a Boolean value set to 0 or 1 and used when    action=“get_string”. It converts all characters to lower case.

Page Syntax

-   The page syntax extracts the paging information from the data. This    allows the end user the ability to change pages just as on the    desktop.-   Page_variable: Defines unique key that defines a family's paging    feature.-   Page_start: Defines value of first page in a family's paging    feature.-   Page_post: Path where paging variable(s) must be transmitted to.-   Page_start :Defines value of first page in a family's paging    feature.-   Page_increment: Defines value that paging increases by for each page    in a family's paging feature.-   Page_block: Defines unique key that defines a family's paging block    feature.-   Page_block_size: Defines the size of the family's page block. (i.e.    10 items per page)-   Url_append: Append the unique key that defines a family's paging    feature and the page number.

Search Syntax

-   Make a web site family's search feature functional by specifying    details such as what variable to post.-   Search_path: Search path where search variable must be transmitted    to-   Search_variable: Name of search variable which a web site's search    feature is looking to read, request, post, etc.-   Url_replace: Remove a portion of the url that is specific to posting    search parameters

URL Syntax

-   The url tag defines global properties for a site, including the url,    and name:-   <url location=“http://www.eshop.ca” key=“eshop.ca” name=“E-Shop”/>-   Name: Name is the name to display when browsing using the gateway    120-   Location: Location defines the fully qualified address of the site.-   Key: Key is the site.

Advanced Syntax

-   The advanced tag defines global properties for the site. This at a    minimum includes the path to the initial page of the site.

<advanced>   <index_link value=“http://www.eshop.ca” />   <check_outvalue=“1” /> </advanced>

-   Index_link: Index_link specifies the path to the initial page of the    site. This is usually the same page as the location property from    the URL syntax. This field is always required.-   Append_link: Appends a string value to every URL requested for this    site.-   No_purchase: No_purchase is a Boolean value 0 or 1. The default    value is 0 which implies that an item should contain a purchase    link. When true, the purchase link is removed.-   No_item: No_item is a Boolean value 0 or 1. The default value is 0    which implies that Item pages should show up in the breadcrumb. When    true, the item is not added to the breadcrumb.-   Check_out: Check_out is a Boolean value 0 or 1. The default value is    0 which implies that Item purchase link sends the request and    control away from the gateway server 120. When true, then a checkout    process has been created for use with gateway server 120.-   Product_img_width: Product_img_width defines the width of all item    images.-   Use_cookies: Use_cookies a Boolean value 0 or 1. By default it is    set to 0, and cookies are not passed to the site. When true, gateway    120 passes all cookies from client machine 102 to the site 104, and    from the site 104 to the client machine.

Page Type Syntax

-   The page type is a collection of lookup queries that have an id    associated with them. Lookup queries may be processed in a top down    fashion. The first successful lookup will trigger another section in    the signature schema document. For example, if the following    evaluates to true:

<page_type> <lookup type=“pex” action=“locate_string”name=“list_elements” id=“mylist_1” ref=“&lt;!--” /> </page_type>Then the tag element <list_elements id=“mylist₁₃ 1” > would be executednext.

General Element Syntax

-   Elements include list_elements, menu_elements, item_elements,    search_elements, form_elements. Each element has an ID. For example    a menu element:-   <menu_element id=“menu_id”/>-   The element may contain the following sub containers (settings,    actions, elements, paging) which scope resides only within the    element. Each element is associated with a specific rendering    function.

<menu_element id=”menu_id”/>   <settings> </settings>   <paging> </paging >   <elements> </ elements >     <actions> </ actions ></menu_element>

Settings Syntax

-   Settings syntax varies based on the type of element it resides in.    Settings allow customizations that only apply to a specific page    family.-   Black_list-menu_elements: Black_list removes menu items with names    that reside in the black list. Each entry is separated delimited    (e.g. using two pound characters (##)-   Pass_image-list_elements, search_elements: Pass_image adds the image    path to the url when requesting an item. The image added to the url    will be used as the item image.-   Price[n]-item_elements: Price[n] where n is an integer renames the    rendered item with name price[n].-   Action-form_elements: Overrides the action of a form displayed to    the end user.-   Handle-form_elements-   Handle=“display”—display the form to the end user.-   Handle=“post”—post the form.-   Handle=“get”—get the form.-   Cookie-form_elements: Send additional cookies when posting this    form.-   Input_[identifier]-form_elements: Input tag adds/modifies a form    value with name [identifier] setting its value.-   Rename_[identifier]-form_elements: Rename tag renames a form value    with name [identifier].

Actions Syntax

-   The actions tag primary function is data manipulation. It contains    lookup queries that modify data with actions of “move_ptr” or    “end_ptr”.

<actions>   <lookup type=“pex” action=“move_ptr” ref=“&lt;/head&gt;” /></actions>

Persons of ordinary skill in the art will appreciate that alternativeembodiments are contemplated. System 100 may be implemented so that oneor more web sites are coupled to the telecommunication network (eitheralone by a server 106 or by one or more web servers like web-server106), and that a corresponding one or more schemas for each of those websites (or each of the web pages therein, or both) can be maintained bygateway and schema server 120 and repository 124. Client machines 102can be configured for proxied connection through different servers 120and for accessing aggregated web site data from database 126. Thoseskilled in the art will now further recognize that server 120 and webserver 125 can be hosted by a variety of different parties, including,for example but without limitation: a manufacturer of client machine102, a service provider that provides access to the telecommunicationnetwork on behalf of user U of a client machine 102; the entity thathosts web-site 104 or a third party intermediary. In web site hostexample it can even be desired to simply combine the web server 106 andschema server engine 120 on a single server to thereby obviate the needfor separate servers. Alternatively the functionality of server 120 andweb server 125 may be locally resident on the client machine providing.

FIG. 4 is a schematic representation of an aggregate web site searchdatabase 126. The database 126 contains one or more tables for storingaggregate web page data extracted from target web sites. The databasemay be a relational database enabling structured database queries andprovides temporary/persistent storage of structured data as a whole orpartially may be indexed for fast performance. For an e-commerce website, the database 126 may contain tables defining a web siteidentification and category index 402 and category data 404 containingspecific item details, as will be discussed in connection with FIG. 5.The category data 404 may be further divided into product data 406sub-tables to store additional product related information. Indices 408can be created to reference aggregate web site data to improve queryresponsiveness and results. Metadata 410 associated with the originalweb pages may be stored such as images, web page formatting ornavigation data. In other web site applications such as news, weather,stock data, patent data, trade-mark data etc., the appropriate tableswould be configured to store the extracted data based upon the signatureschema. The database 126 may alternatively reside in client machine 102Amemory. The client machine would generate the request to the desiredwebsites and store extracted data locally.

FIG. 5 is a schematic representation of tables in an aggregate web sitesearch database for e-commerce. The extracted data from selectede-commerce web sites can be formatted into category index 402 whichindicates identification 502 of the web site where the data wasextracted from and the category 504 of interest. In this example entry506 identifies http://www.eshop2.ca/pg=3 as the reference for wheredigital cameras may be indexed. For each entry in the category index402, an individual category 404 table can be created. The category tableprovides location identifiers 510 for each product retrieved from theweb site based upon the defined signature schema and populated at step310 during the retrieval of data from the web site. The vendor 512,title 514 of the product, price 516 and description 518 extracted canthen be stored. It should be understood that the categories identifiedcan be tailored to the application. The database then provides a meansfor querying the aggregate data from the web site to present meaningfulinformation to the client machine 102A. By storing the aggregate dataqueries can be created to meet desired requirements and transcoded forpresentation on the client machine 102A.

FIG. 6 illustrates a method of creating the aggregate web site searchdatabase. The web sites 103, 104 and 105 that are to be indexed areidentified, and page request(s) 602 are sent to the web site to fetchpages by data collection engine 150 hosted on web server 125. The datacollection engine 150 is a web crawler to independently collect datafrom target web sites for storage in aggregate database 126.Alternatively, the data collection engine 150 may be included in theengine 140 of gateway 120. The signature schema for the web site isretrieved 604 from repository 124 by engine 140. The schema is applied606 to the received web site data to extract relevant data from the webpage content. The extracted data is stored 608 to the aggregate database126 in the appropriate tables. The process is repeated for each relevantweb site to generate the aggregate web site search database. The datacollection engine 150 can then periodically access web sites 103, 104and 105 to ensure stored data is accurate and up to date. As the schemais defined to extract elements with which objects and their attributeson the web page can be defined or described and the schema incorporatesknowledge of what these objects and attributes represent, an intelligentand indexed database 126 can be defined.

FIG. 7 illustrates a method of querying the aggregate web site searchdatabase 126. A search query or request is generated by a user via aninterface on client machine 102A and received 702 at gateway 120 or maybe directed to the aggregate web site 150. The web site 150 processesthe request and generates 704 the relevant database query such as SQL(Structured Query Language) to database 126. The relevant data is thenretrieved 706 from the database 126. The retrieved data or searchresults, can then be formatted and provided 708 accommodate clientmachine 102A characteristics in response to request 702 to the clientmachine 102A. If the request is made through a web application thesearch result data may be formatted to accommodate the client machinebrowser. For example as shown in FIG. 7C. Alternatively the data may bepushed to a push application on the client machine 102A. The web sitedata may be retrieved periodically by data collection engine 152 whichmay be triggered manually, scheduled or on-going.

1. A method of aggregating web site data from one or more web sites, themethod comprising: sending a page request to a web site selected fromthe one or more web sites; receiving the requested web page from theselected web site; retrieving signature schema associated with therequested web page wherein the signature scheme identifies data fieldswithin the requested web page; applying signature schema to therequested web page to extract data from the requested web page; andstoring extracted data to an aggregate database, wherein the aggregatedatabase comprises data extracted from the one or more web sites.
 2. Themethod of claim 1 further comprising: receiving a search query from aclient machine for data stored in the aggregate database; generating adatabase query based upon the received search query; and retrieving datafrom the aggregate database defined by the database query.
 3. The methodof claim 2 further comprising: generating a retrieved data web page forthe client machine using the retrieved data; and providing the retrieveddata web page to the client machine.
 4. The method of claim 2 whereinthe search query is generated by a push application on the clientmachine and wherein the retrieved data is sent to the push application.5. The method of claim 1 wherein the data is stored in a relationaldatabase.
 6. The method of claim 5 wherein the database query is aStructured Query Language (SQL) query.
 7. The method of claim 1 whereinthe signature schema is retrieved from a repository of one ore moresignature schemas, wherein each schema is defined for each of the one ormore web sites.
 8. The method of claim 1 wherein the signature schemacomprises eXtensible Markup Language (XML) documents comprising querylanguage for extracting data from the requested web page.
 9. The methodof claim 1 wherein the aggregate database comprises a category tableidentifying a hypertext transport protocol (HTTP) link and a productcategory and wherein the aggregate database further comprises a producttable identifying a product specific HTTP link, product information andpricing for each identified product category.
 10. The method of claim 1wherein the aggregated database is stored on the client machine.
 11. Themethod of claim 1 where in the client machine is a wireless device. 12.A system for aggregating web site data from one or more web sites, thesystem comprising: at least one computing device comprising a processorand a memory coupled thereto, said memory storing instructions and datafor configuring the processor to: send a page request to a web siteselected from the one or more web sites; receive a web page from theselected web site based upon the sent page request; retrieve signatureschema associated with the requested web page; apply signature schema tothe requested web page data to extract data identified by the signatureschema; and store extracted data to an aggregate database comprisingdata extracted from the one ore more web sites.
 13. The system of claim12 wherein the instructions for configuring the processor are furtherconfigured to: receive a search request for data from the aggregatedatabase from a client machine; generate a database query based upon thereceived search request; and retrieve data from the database defined bythe database query.
 14. The system of claim 12 wherein the signatureschema is retrieved from a repository of one ore more signature schemaswherein each schema is defined for each of the one or more web sites.15. The system of claim 12 wherein the signature schema are eXtensibleMarkup Language (XML) documents comprises query language for extractingdata the requested web page.
 16. The system of claim 12 wherein theaggregate database comprises a category table identifying a hypertexttransport protocol (HTTP) link and a product category and wherein theaggregate database further comprises a product table identifying aproduct specific HTTP link, product information and pricing for eachidentified product category.
 17. The system of claim 12 wherein the datais stored in a relational database.
 18. The system of claim 17 whereinthe query is a Structured Query Language (SQL) query.
 19. The system ofclaim 12 wherein the processor is further configured to: generate aretrieved data web page for the client machine using the retrieved data;and provide the retrieved data web page to the client machine.
 20. Thesystem of claim 12 where in the search query is generated by a pushapplication on the client machine and the retrieved data is sent to thepush application.
 21. The system of claim 12 wherein the aggregated database is stored on the client machine.
 22. The system of claim 21 wherein the client machine is a wireless device.
 23. A computer programproduct storing computer readable instructions which when executed by acomputer processor configure the processor for: sending a page requestto a web site selected from the one or more web sites; receiving therequested web page from the selected web site; retrieving signatureschema associated with the requested web page wherein the signaturescheme identifies data fields within the requested web page; applyingsignature schema to the requested web page to extract data from therequested web page; and storing extracted data to an aggregate database,wherein the aggregate database comprises data extracted from the one ormore web sites.
 24. A method of aggregating web site data from one ormore web sites, the method comprising: sending a page request to a website selected from the one or more web sites; receiving the requestedweb page from the selected web site; retrieving signature schemaassociated with the requested web page wherein the signature schemeidentifies data fields within the web and wherein the signature schemaare extensible Markup Language (XML) documents comprising query languagefor extracting data from the requested web page; applying signatureschema to the received web page to extract data from the requested webpage; storing extracted data to an aggregate database, wherein theaggregate database comprises data extracted from the one or more websites; receiving a search query from a client machine for data stored inthe aggregate database; generating a database query based upon thereceived search query; and retrieving data from the aggregate databasedefined by the query.
 25. The method of claim 25 where in the clientmachine is a wireless device.